Skip to content

Make it easier for user to search for tags#542

Merged
ikuyarihS merged 9 commits into
masterfrom
fuzzy-tag-search
Feb 7, 2020
Merged

Make it easier for user to search for tags#542
ikuyarihS merged 9 commits into
masterfrom
fuzzy-tag-search

Conversation

@ikuyarihS
Copy link
Copy Markdown
Contributor

@ikuyarihS ikuyarihS commented Oct 18, 2019

Closes #231

Applying the algorithm for Needles and Haystack to find and match tag in tags, for example:

Example

This only applies to searching tag_name with more than 3 in length, and at least 80% of its letters are found, from left to right.

There are 3 levels of checking, stop at first found:

  • Check if exact name ( case insensitive ) O(1) getting from a dictionary Dict[str, Tag]
  • Check for all tags that has 100% matching via algorithm
  • Check for all tags that has >= 80% matching

If there are more than one hit, it will be shown as suggestions:

Suggestions

In order to avoid api being called multiple times, I've implemented a cache to only refresh itself when the is a gap of more than 5 minutes from the last api call to get all tags.

Editing / Adding / Deleting tags will also modify the cache directly.

What about other solution like fuzzywuzzy?

fuzzywuzzy was considered for using, but from testing, it was giving much lower scores than expected:

Code used to test:

from fuzzywuzzy import fuzz

def _fuzzy_search(search: str, target: str) -> bool:
    found = 0
    index = 0
    _search = search.lower().replace(' ', '')
    _target = target.lower().replace(' ', '')
    for letter in _search:
        index = _target.find(letter, index)
        if index == -1:
            break
        found += index > 0
    # return found / len(_search) * 100
    return (
        found / len(_search) * 100,
        fuzz.ratio(search, target),
        fuzz.partial_ratio(search, target)
    )

tests = (
    'this-is-gonna-be-fun',
    'this-too-will-be-fun'
)

for test in tests:
    print(test, '->', _fuzzy_search('this too fun', test))

Result from test:

this-is-gonna-be-fun -> (30.0, 50, 50)
this-too-will-be-fun -> (90.0, 62, 58)

#### Closes #231

Applying the algorithm for `Needles and Haystack` to find and match tag in tags, for example:

![Example](https://cdn.discordapp.com/attachments/634243438459486219/634592981915140107/unknown.png)

This only applies to searching tag_name with more than 3 in length, and at least 80% of its letters are found, from left to right.

There are 3 levels of checking, stop at first found:
- Check if exact name ( case insensitive ) O(1) getting from a dictionary Dict[str, Tag]
- Check for all tags that has 100% matching via algorithm
- Check for all tags that has >= 80% matching

If there are more than one hit, it will be shown as suggestions:

![Suggestions](https://cdn.discordapp.com/attachments/634243438459486219/634595369531211778/unknown.png)

In order to avoid api being called multiple times, I've implemented a cache to only refresh itself when the is a gap of more than 5 minutes from the last api call to get all tags.

Editing / Adding / Deleting tags will also modify the cache directly.

##### What about other solution like fuzzywuzzy?

fuzzywuzzy was considered for using, but from testing, it was giving much lower scores than expected:

Code used to test:

```py
from fuzzywuzzy import fuzz

def _fuzzy_search(search: str, target: str) -> bool:
    found = 0
    index = 0
    _search = search.lower().replace(' ', '')
    _target = target.lower().replace(' ', '')
    for letter in _search:
        index = _target.find(letter, index)
        if index == -1:
            break
        found += index > 0
    # return found / len(_search) * 100
    return (
        found / len(_search) * 100,
        fuzz.ratio(search, target),
        fuzz.partial_ratio(search, target)
    )

tests = (
    'this-is-gonna-be-fun',
    'this-too-will-be-fun'
)

for test in tests:
    print(test, '->', _fuzzy_search('this too fun', test))
```

Result from test:
```py
this-is-gonna-be-fun -> (30.0, 50, 50)
this-too-will-be-fun -> (90.0, 62, 58)
```
@ikuyarihS ikuyarihS added t: feature New feature or request area: cogs p: 3 - low Low Priority labels Oct 18, 2019
@ikuyarihS ikuyarihS requested review from SebastiaanZ and sco1 October 18, 2019 03:41
@ikuyarihS ikuyarihS self-assigned this Oct 18, 2019
@kosayoda
Copy link
Copy Markdown
Contributor

Looking at the fuzzy search, have you considered the built-in difflib module?

@MarkKoz
Copy link
Copy Markdown
Contributor

MarkKoz commented Oct 24, 2019

We actually discussed if it'd be better to do the fuzzy search server-side on the API. I haven't looked into it deeply but here are some relevant links:

https://docs.djangoproject.com/en/2.2/ref/contrib/postgres/search/
https://github.com/vsemionov/django-rest-fuzzysearch

I'm not sure if it'd be better to do it server or client side. I think that if there is room for fuzzy search to be used in the future with other endpoints (new or existing), then it should be server side. Another factor would be to see how accurate the pg search features are for our needs here.

@ikuyarihS
Copy link
Copy Markdown
Contributor Author

Looking at the fuzzy search, have you considered the built-in difflib module?

I've taken a look at it, it proves to be quite useful to get the differences in before and after in the on_message_edit event. I've looked at its SequenceMatcher, it provides similar result to fuzzywuzzy

...
s = difflib.SequenceMatcher(lambda x: x in ' -', search, target)
return (
    found / len(_search) * 100,
    ('fuzzy', fuzz.ratio(search, target), fuzz.partial_ratio(search, target)),
    ('difflib', tuple(map(lambda x: x * 100, (s.ratio(), s.real_quick_ratio(), s.quick_ratio()))))
)

# --------------------------------
this too fun & this-is-gonna-be-fun -> (30.0, ('fuzzy', 50, 50), ('difflib', (50.0, 75.0, 50.0)))
this too fun & this-too-will-be-fun -> (90.0, ('fuzzy', 62, 58), ('difflib', (62.5, 75.0, 62.5)))

I've thought about either this should be done from the API or from the bot, I think having a cache on the bot will give better performance, specially if we do not modify tags from the site-side and restrict modifying tags to be via bot's commands only, then we can maintain a cache that's perfectly synced with the site.

Postgres search feature looks powerful too, I'll definitely want to see how it performs as well.

@SebastiaanZ
Copy link
Copy Markdown
Contributor

Did we decide on an approach for this? Bot-side or server-side? I kinda like the idea of shipping a query off to the API and having postgres do its thing.

@MarkKoz
Copy link
Copy Markdown
Contributor

MarkKoz commented Nov 5, 2019

Given #388 it's better to keep it client-side as it would eventually have to be client-side anyway. However, that issue is stale so I don't know if we still want to do that. If not, then I agree with doing it on the server-side.

@SebastiaanZ
Copy link
Copy Markdown
Contributor

That's a good point; I'd forgotten about that. Let's ask our tag master, @fiskenslakt, what his current opinion on the matter is to make sure we get this thing moving again.

@scragly
Copy link
Copy Markdown
Contributor

scragly commented Nov 15, 2019

I closed #388. The meta repo now contains markdown files of any of our tags for now for the public to read through and be able to submit PRs for adding or editing tags.

At the moment the process of adding or editing the tags is done via bot command in-server or by using the site's admin page (mods+ now have full access to the tag admin page).

There's improvements that can be done to make things easier and to automate/integrate the process, but I'm of the opinion that tags will continue to live on the database, be accessible via API and editable via web admin, and as such we should probably stick to doing fuzzymatching api-side.

@scragly scragly added a: API Related to or causes API changes a: information Related to information commands: (doc, help, information, reddit, site, tags) s: stalled Something is blocking further progress type: Enhancement and removed t: feature New feature or request labels Nov 15, 2019
@lemonsaurus lemonsaurus added t: feature New feature or request and removed type: enhancement labels Dec 15, 2019
@jb3 jb3 requested a review from a team as a code owner February 2, 2020 22:52
MarkKoz
MarkKoz previously requested changes Feb 3, 2020
Copy link
Copy Markdown
Contributor

@MarkKoz MarkKoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is already done and there has been no progress re-implementing this on the site, I think it is best to get this PR merged for the time being.

Comment thread bot/cogs/tags.py Outdated
Comment thread bot/cogs/tags.py Outdated
Comment thread bot/cogs/tags.py Outdated
@MarkKoz MarkKoz added s: waiting for author Waiting for author to address a review or respond to a comment and removed s: stalled Something is blocking further progress a: API Related to or causes API changes labels Feb 3, 2020
- Changed type of `self._last_fetch` to `float` and give it the initial value of `0.0` instead of `None`
- Assigned `time.time()` to `time_now` to avoid calling this function twice.
- Added `self._last_fetch = time_now` after calling the api call.
…ciency.

- Matching scores will be calculated once now and stored in the dict `scores`.
- Allow `_get_suggestions()` to go through a list of score threshold and return the first list of matching tags that's not empty and above the threshold. This avoid calling the function multiple time like before ( `self._get_suggestions(tag_name, 100) or self._get_suggestions(tag_name, 80)` for example, is calling this function twice, and is inefficient )
- Deleted commented line.
- Added `typing` module for more typehints.
@ikuyarihS ikuyarihS added status: needs review and removed s: waiting for author Waiting for author to address a review or respond to a comment labels Feb 4, 2020
@MarkKoz MarkKoz dismissed their stale review February 4, 2020 17:59

Addressed

Copy link
Copy Markdown
Contributor

@MarkKoz MarkKoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a tag named foo-bar, foobars will not match and neither will foo_bar. foobar does match. The tags command doesn't seem to like spaces in tag names - it will never match.

- Added a regex to remove non-alphabet ( `[^a-z]` with `re.IGNORECASE` )
… 60]

- Since it is returning as soon as there are suggestions found for a threshold, this will give a better reflection of what the bot thinks user is searching for.
@ikuyarihS
Copy link
Copy Markdown
Contributor Author

Interesting! I've added a regex to remove all non-alphabet, as well as increasing threshold from [100, 80] to [100, 90, 80, 70, 60] since it stops as soon as suggestions are found for a threshold, this will give better suggestions that the bot thinks is what the user is searching for. This solved for both the foo_bar and foobars when searching for foo-bar

@MarkKoz
Copy link
Copy Markdown
Contributor

MarkKoz commented Feb 4, 2020

In some cases that still isn't working so well:

  • asks returns args-kwargs instead of ask
  • foos returns off-topic and functions-are-objects instead of foo
  • dict returns iterate-dict without considering dictcomps too
  • opens returns both scope and open when the latter is obviously a much better mach

Also discovered an unrelated issue in which it can't handle DELETE or GET requests for tags with spaces in them (returns 404). Might be a URL encoding issue since the tag is part of the URL path. It can POST fine because the tag name is instead part of the JSON.

@ikuyarihS
Copy link
Copy Markdown
Contributor Author

ikuyarihS commented Feb 5, 2020

Hmm, I've added another complexity that will force this to search from words to words, here's the snippets I used to test

import re
from typing import Dict, List, Optional

REGEX_NON_ALPHABET = re.compile(r"[^a-z]", re.MULTILINE & re.IGNORECASE)

stuff = ['args-kwargs', 'ask', 'class', 'classmethod', 'codeblock', 'decorators', 'dictcomps', 'enumerate', 'except', 'exit()', 'f-strings', 'foo', 'functions-are-objects', 'global', 'if-name-main', 'indent', 'inline', 'iterate-dict', 'listcomps', 'mutable-default-args', 'names', 'no-dm',
         'off-topic', 'open', 'or-gotcha', 'param-arg', 'paste', 'pathlib', 'pep8', 'positional-keyword', 'precedence', 'quotes', 'relative-path', 'repl', 'return', 'round', 'scope', 'seek', 'self', 'star-imports', 'traceback', 'windows-path', 'with', 'xy-problem', 'ytdl', 'zen', 'zip', ]

_cache = dict(zip(stuff, stuff))


def _fuzzy_search(search: str, target: str) -> int:
    """A simple scoring algorithm based on how many letters are found / total, with order in mind."""
    current, index = 0, 0
    _search = REGEX_NON_ALPHABET.sub('', search.lower())
    _targets = iter(REGEX_NON_ALPHABET.split(target.lower()))
    _target = next(_targets)
    try:
        while True:
            while index < len(_target) and _search[current] == _target[index]:
                current += 1
                index += 1
            index, _target = 0, next(_targets)
    except (StopIteration, IndexError):
        pass
    return current / len(_search) * 100


def _get_suggestions(tag_name: str, thresholds: Optional[List[int]] = None) -> List[str]:
    """Return a list of suggested tags."""
    scores: Dict[str, int] = {
        tag_title: _fuzzy_search(tag_name, tag)
        for tag_title, tag in _cache.items()
    }

    thresholds = thresholds or [100, 90, 80, 70, 60]

    for threshold in thresholds:
        suggestions = [
            _cache[tag_title]
            for tag_title, matching_score in scores.items()
            if matching_score >= threshold
        ]
        if suggestions:
            return f"{repr(tag_name)} - {suggestions}"

    return f"{repr(tag_name)} not found"


print(_get_suggestions('fstring'))
print(_get_suggestions('fstrings'))
print(_get_suggestions('fstr'))
print(_get_suggestions('f-str'))
print(_get_suggestions('f-string'))
print(_get_suggestions('f-strings'))
print(_get_suggestions('asks'))
print(_get_suggestions('foos'))
print(_get_suggestions('dict'))
print(_get_suggestions('opens'))
print(_get_suggestions('or'))
print(_get_suggestions('or-g'))
print(_get_suggestions('or-'))
print(_get_suggestions('got'))
print(_get_suggestions('path'))
print(_get_suggestions('main'))
print(_get_suggestions('if'))
print(_get_suggestions('if main'))
print(_get_suggestions('asdfasdf'))

Here are the results:

'fstring' - ['f-strings']
'fstrings' - ['f-strings']
'fstr' - ['f-strings']
'f-str' - ['f-strings']
'f-string' - ['f-strings']
'f-strings' - ['f-strings']
'asks' - ['ask']
'foos' - ['foo']
'dict' - ['dictcomps', 'iterate-dict']
'opens' - ['open']
'or' - ['or-gotcha']
'or-g' - ['or-gotcha']
'or-' - ['or-gotcha']
'got' - ['or-gotcha']
'path' - ['pathlib', 'relative-path', 'windows-path']
'main' - ['if-name-main']
'if' - ['if-name-main']
'if main' - ['if-name-main']
'asdfasdf' not found

- Added regex back to sub and split by non-alphabet.
- Now use two pointers to move from words to words.
Copy link
Copy Markdown
Contributor

@MarkKoz MarkKoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's working much better.

Copy link
Copy Markdown
Contributor

@Akarys42 Akarys42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works great!

@ikuyarihS ikuyarihS merged commit e205bf6 into master Feb 7, 2020
@ikuyarihS ikuyarihS deleted the fuzzy-tag-search branch February 7, 2020 08:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

a: information Related to information commands: (doc, help, information, reddit, site, tags) p: 3 - low Low Priority t: feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tag fuzzy matching and aliasing

8 participants